We looked at this last class and got the data to a almost-clean enough format for us to work with. We quickly run the commands that get up to the point where we left off.
load("atusresp.rda")
load("atussum.rda")
load("atuscps.rda")
social_cols <- names(atussum)
social_cols <- grep("t12", social_cols)
social_times <- rowSums(atussum[,social_cols])
x <- atussum[,c(1, 3:8)] # we like these columns!
x$social_time <- social_times
keep <- c("tucaseid", "num_children", "weekly_earn", "diary_date")
y <- atusresp[, keep]
x <- merge(x, y, by="tucaseid", all.x = TRUE)
keep <- c("tucaseid", "age", "sex", "famincome", "hh_size") # use age to check
y <- atuscps[atuscps$atus == 1,keep]
z <- merge(x, y,by="tucaseid", all.x=TRUE)
We were pondering this problem of strange differences in reported age between ATUS and CPS for the same individuals.
table(z$age.x-z$age.y)
##
## -44 -39 -38 -36 -33 -32 -31 -30 -29 -28 -27 -24 -22 -21 -20
## 1 2 1 1 2 1 2 2 1 1 1 3 1 1 1
## -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5
## 1 1 2 5 1 2 1 2 2 7 5 3 3 6 3
## -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
## 6 6 8 48 9436 1914 21 7 8 5 3 2 3 5 5
## 11 12 13 14 15 16 18 21 22 24 26 27 28 29 30
## 4 3 3 2 3 2 3 4 2 1 2 2 3 2 4
## 31 33 37 39 40 44 48 49 57 64
## 1 1 1 1 1 1 1 1 1 1
We decided that maybe we should toss out all of the individuals with a discrepancy greater than 1 in absolute value. We did the following:
# z <- z[abs(z$age.x-z$age.y) <= 1,]
Remove the -1’s as well.
agediffs <- z$age.x - z$age.y
z <- z[agediffs == 0 | agediffs == 1,]
z <- z[agediffs %in% 0:1, ] # same as above
Check the ‘sex’ variables:
table(z$sex.x, z$sex.y)
##
## M F
## M 4928 3
## F 2 6186
The entire set of variables:
names(z)
## [1] "tucaseid" "age.x" "sex.x" "edu"
## [5] "race" "hispanic" "metro" "social_time"
## [9] "num_children" "weekly_earn" "diary_date" "age.y"
## [13] "sex.y" "famincome" "hh_size"
Tossing out the duplicate variables:
all_cols <- names(z)
z <- z[, !(all_cols %in% c("age.y", "sex.y"))]
names(z)
## [1] "tucaseid" "age.x" "sex.x" "edu"
## [5] "race" "hispanic" "metro" "social_time"
## [9] "num_children" "weekly_earn" "diary_date" "famincome"
## [13] "hh_size"
Let’s clean up the ‘.x’ column names:
gsub("\\.x", "", names(z))
## [1] "tucaseid" "age" "sex" "edu"
## [5] "race" "hispanic" "metro" "social_time"
## [9] "num_children" "weekly_earn" "diary_date" "famincome"
## [13] "hh_size"
names(z) <- gsub("\\.x", "", names(z))
names(z)
## [1] "tucaseid" "age" "sex" "edu"
## [5] "race" "hispanic" "metro" "social_time"
## [9] "num_children" "weekly_earn" "diary_date" "famincome"
## [13] "hh_size"
Save this file:
write.csv(z, "atus_social.csv", row.names = FALSE)
# x <- read.csv("atus_social.csv", as.is = TRUE)
# names(x)